How to Use PyTorchVideo

It looked quite useful, so here are some notes.

Using Pretrained Models

1. Loading a Model (via torch hub)

There is a dedicated hubconf.py on the master branch, so specify the path to it and load the model by name as a string.

model_name = "slow_r50"
path = 'path/to/directory/of/hubconf.py'
model = torch.hub.load(path, source="local",
                        model=model_name, pretrained=True)

Note that the model has the type 'pytorchvideo.models.net.Net', so when integrating it into Lightning, make it an attribute.

2. Prepare Transforms to Convert Input Video to the Required Format

Match the video to the model's specifications. slow_50 requires 256x256 dimensions with normalized RGB. Also, the number of frames per input varies by model, so you need to specify this in advance.

For details, see here.

from pytorchvideo.transforms import (
   ApplyTransformToKey,
   ShortSideScale,
   UniformTemporalSubsample
)

side_size = 256
mean = [0.45, 0.45, 0.45]
std = [0.225, 0.225, 0.225]
crop_size = 256
num_frames = 8
sampling_rate = 8
frames_per_second = 30
clip_duration = (num_frames * sampling_rate)/frames_per_second

transform =  ApplyTransformToKey(
   key="video",
   transform=Compose(
       [
           UniformTemporalSubsample(num_frames),
           Lambda(lambda x: x/255.0),
           NormalizeVideo(mean, std),
           ShortSideScale(
               size=side_size
           ),
           CenterCropVideo(crop_size=(crop_size, crop_size))
       ]
   ),
)

3. Encoding the Video

If you have a video file ready, encoding can also be handled for you. .avi files work as well. The steps are:

Encode the video
Clip by specifying seconds
Pass through the transform

from pytorchvideo.data.encoded_video import EncodedVideo

sample_path = 'sample.avi'

# 1
video = EncodedVideo.from_path(sample_path)
# 2
video_cliped = video.get_clip(start_sec=0, end_sec=10)
# 3
video_data = transform(video_data)

4. Feeding Input to the Model

The transformed video is in dictionary format where:

video gives you the video tensor (C, T, H, W), audio gives you the audio, and video_name gives you the original path.

The model takes a tensor of shape (batch_size, C, T, H, W), so:

input = video_data['video']
prediction = model(input.unsqueeze(0))

Bonus: Integrating with Lightning

There are two things to do:

Integrate the model
Write a DataModule (Transform + label assignment)

class VideoClassification(pytorch_lightning.LightningModule):
  def __init__(self, path):
      super().__init__()
      self.model = torch.hub.load(path, source="local",
                        model=model_name, pretrained=True)

  def configure_optimizers(self):
      return torch.optim.Adam(self.parameters(), lr=1e-1)

  def forward(self, x):
      return self.model(x)

  def training_step(self, batch, batch_idx):
      y = self.model(batch["video"])
      t = batch["label"]
      loss = F.cross_entropy(y, t)

      return loss

from pytorchvideo.transforms import (
    ApplyTransformToKey,
    RandomShortSideScale,
    RemoveKey,
    ShortSideScale,
    UniformTemporalSubsample
)

from torchvision.transforms import (
    Compose,
    Normalize,
    RandomCrop,
    RandomHorizontalFlip
)

class KineticsDataModule(pytorch_lightning.LightningDataModule):

    def setup(self);
        self.train_dataset = pytorchvideo.data.Kinetics(
            data_path=os.path.join(self._DATA_PATH, "train.csv"),
            clip_sampler=pytorchvideo.data.make_clip_sampler("random", self._CLIP_DURATION),
            transform=train_transform
        )

    def train_dataloader(self):
        train_transform = Compose(
            [
            ApplyTransformToKey(
              key="video",
              transform=Compose(
                  [
                    UniformTemporalSubsample(8),
                    Normalize((0.45, 0.45, 0.45), (0.225, 0.225, 0.225)),
                    RandomShortSideScale(min_size=256, max_size=320),
                    RandomCrop(244),
                    RandomHorizontalFlip(p=0.5),
                  ]
                ),
              ),
            ]
        )
        return torch.utils.data.DataLoader(
            train_dataset,
            batch_size=self._BATCH_SIZE,
            num_workers=self._NUM_WORKERS,
        )

Using Pretrained Models​

1. Loading a Model (via torch hub)​

2. Prepare Transforms to Convert Input Video to the Required Format​

3. Encoding the Video​

4. Feeding Input to the Model​

Bonus: Integrating with Lightning​